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ABSTRACT 

A Method for efRciem browsing and searching of people in video is described. The method 
cozisists of a pre-processing stage which automatically parses and indexes the video content 
based on fikcial infbttnatioli. A face detecticm unit indicates a Itypothesis for the presence of a 
face. A face-tracking unit then tracks via the detected &ce fo excract the related video 
segment, A representailive set of &ciai viewpoints is extracted &om the video segment and 
characteristic facial features are stored. Each newly detected face is matched with a currently 
existing face database to augment the datjthasft at an already existing entry or to introduce a 
new face entxy. The video face index can be displayed, edited, annotated and browsed 
efficiently, by person. In searching, the face ind^ is accessed to redaced processing time and 
to increase recognition accuracy. Extracted audio characteristic data is used to validate face 
matching across video scenes and to augment die face indexing data for futm'e recogmtion. 

DESCRIPTION OP THE BACKGROUND OF THE INVENTION 

1. FIELD OF THE INVENTION 

The present invention relates to video indexing, logging, browsing and searclikig. The 
invention focuses on the anxomatic parsing and indexing of higK*level video content, such as 
sTTUCXured objects. Moreover, the Invention relates between the indexing scheme and the 
formatian of an intelligent index database for recognition tasks in qvesying applications. 
More paiticularly, the inventiQn describes a particular focus on the automatic parsing and 
bdexing of video content based on facial infannadon for efRcient browsing and searching of 
people iu video. 

2. DESCRIPTION OF THE RELATED ART 

The amount of video data stored in multimedia libraries grows very rapidly which makes 
searching a time consuming task. Both time and storage requirements can be reduced by 
creating a compact representation of the video footage in the form of key-&ames, that is a 
subset of the originsl video frames which are used a3 a representation for diese original video 
&ames. Prior an ibcuses on key-&ame extzaction as basic primitives in tiie representation of 
video for browsing and searching appL'caticns. 



A system far video browsing and searching, based on key-frames, is depicted in Fig. lA. A 
video image sequence is input from a video feed module 110. The video feed may be a live 
program or recorded on tE^. Analog video is digitized in video digitizer 112. Optionally, the 
system may receive digital representation such as Motion JPEG or MPEG directly. A user 
interface console 11 1 is used to select program and digitization parameters as well as to 
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control key-frame selection. A key-frame selection module 113 receives the digitized video 
image sequence. Key-frames can be selected at scene iraiisiuons by detecting cuts and gradual 
transitians such as dissolves. This coarse segmentation into shots can be refmed by selecting 
additional key-frame in a process of tracking changes in the visual appearance of the video 
along the shot. A feature extracdoa module 114 prccesses key-frames as well as non key- 
frame video data to compute key-frames characteristic data. These data are stored in the key- 
frames characteristic data store and are accessed by a video search engine 116 in order to 
answer queries generated by the browscr-searchfii interface 117. Such queries may relate to 
content attributes of fee video data such as color, texture, motion and others. Key-frame daia 
can also be accessed directly by the browser 117, In browmg, a user may review key-frames 
instead of the original video, thus reducing storage and band^^idtii requirements. 

In an edited video program, the editor switches between different scenes. Thus, a certain 
collection of Kt video shots may consist only of N<M different scenes such that at least one 
scene spans more than one shot. Prior ait describes hew lo cluster such shots based on their 
similarity of appcarBncc for video browsing applications. It has become standard practice to 
extract features such as color, texture and motion cues, together with a set of distance metrics, 
and then utilize the distance meuics m the related feature spaces (respectively) for 
determining die similarity between key-frames of the video contents or shots. In this scenario, 
tiie video content is limited to the definition in the low-level feature space. What is missing in 
the prior art is ths autoimaxic extraction of high-level object-related information from tbe 
video during the indexing phase, so as to fecilitate taure searching applications. 
In current systems for browsing and automatic searching, which are based on key-frames, the 
key-frames extraction and the automatic seardiing are separate processes. Combining the 
processes in a unified framework means taking into account high-level user-queries (in search 
mode), (hiring die indexing phase. Spending more cfTart in a more intelligent indexing 
process proves beneficia] in a short turn around rate in the searching process. 

In automatic searching of video data by content, detecting and recognizing faces is of primary 
importance for many application domains such as new*. Prior art describes methods for fece 
detection and recognition in stlU Images and hi video image sequences. 



A prior art mediod of face detection and recognition in video is depicted in Fig IB. A face 
detection module (122) operates on a set of frames frtmi the input video image sequence. This 
set of frames may consist of the entire image sequence if the probability of detection is of 
primaxy importance. However, the &ce content in the video does not change every frame. 
Therefore, to save computational resources a subset ocf the original sequence may be utilized. 
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Such a subset may be derived by decimamig tlie sequence in time by a fixed factor or by 
selecting k^-irames by key-&ame extraction module 121. The detected {ocgs are 
sequentially Stored in Ibe face detection data store 323. 

For eadL detected face, &ce features are extracted (124), where the features can be &cia] 
feature templates, geometrical constraints* and global facial characteii sties, sach as eigen^ 
features aiKl otber knoxvn algorithim in the art. The face representation can be compared to a 
currently awaiting search query, or to a predefined face database^ or alternatively it can be 
stored in a fdct feature database (125) for future use. By comparing face charact^stic data 
from the database or &om a user-defined query with the &ee characteristic data extracted 
from the video, the identity of people in the video can be established. This is done by the &ce 
recoguition module 126 and recorded in the video face Fecogcition report 127 witb the 
associated confidence fiutor. 

Several algorithms for fioce recognition are described in the prior art In particular one prior 
art embodiment uses a set of geometrical features, such as nose %vidth and lengthy mouth 
posjtiDu and chin shape: Ano^er particular method is based on templEite matching. One 
particular embodiment represents the query and the detected &ces as a combination of eigea- 
faces. 

In a co»pendiQg application by the same asBignee^ entitled W method of automatic extraction 
of key-frames from a video sequence^ a method of key-frame extiaction is described. The 
^plicanon discloses a method for post-processing of 'die key-frame get so as to optimize the 
set for face lecoenition. A face detection algorithm is applied to the key-frames and in the 
case of a possible face detected, the position of tiiat key-frame along the time axis is possibly 
modified to allow a locally better view of the face. This application does not teat^ how to link 
between different views of tiie same person or how to globally optimize i3ie different views 
retained of that persozL 



Figure IC shows a simple sequesice of video scenes and the associated &ce contet&t, or lack of 
it. In this example, some people appear in several scenes. Addxdanally, some scenes have 
mom than one person dftpicteil Figure ID depicts the resiibs of a sequeptial tace"index^- 



scheme such as the one depicted in Figure IB. Clearly this repieseotation provides & highly 
redimdant des^iption of tise frice content of the video scenes. In future recognition tasks, each 
frame will be processed independently. 
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In a dynamic scene, a person may be visible for onfy a pait or several parts of tbc scene. 
During a scene and across the scenes a person generally has many redundant views, but also 
several different views. In fiuch a situation it is desirable to prune redundant views on the one 
hand, yex to increase the recognition robustness by compaiing the user-defined query against 
all the available different \dews of the samd person. 

Also during a video segment a person may go from a posirion where he can be detected to a 
position whew he is visible but cannot be detected by automatic processing. In several 
applications it is useful to report the full segment of visibility for each recognised person. 
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Prior art does not teach how to detect and index face instances in a video seqiience to support 
these desirable features. In particular, prior art does not teach how to parse an input video 
stream into fece segments and how to link between similar fao© segments. Furthennore, jaior 
art does not teach how to extract a represcntfttive face index and fece firame-set^ v/ith multiple 
views for each person, such that &ce regions can be later detected and recognized with high 
probability of success. 

SUMMARY OF THE INVENTION 

The general problem solved by this mvention is that of parsing a video stream into face/no 
&ce segments, indexitkg and logging the facial content, and utilizing the indexed content as an 
intelligent fecial database for future &cial content queries of the video data. 

The invention introduces the use of a higii-levei viBual module in the indexing of a video 
stream, specifically, tiie use of human facial information. 

It is an object of the invention to provide an output facial index of the video. Another object 
of die invention is to provide an output log fox the detected faces. A still further object is to 
provide a facial database that accommodates future video search via ^ial queries into the 
video archive. 



A further object of the invention is to s^nificantly increase the speed of the recognition. Vet 
another object of the invention is to improve the probability of recognition. 



BRIEF DESCRIPTION OF THE DRAWINGS 

Figure lA describes a prior art method of searching in video by k^-frame selection and 
feature extractioiL 

Figure IB describes a prior art mediod of sequential face indexing and face recognition. 
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Figum IC describes a sample sequence of video content. 

Figure ID presents the resuks of fac« detection applied to the airangement of Figure IC, 
organized as a sequential index. 

Figure 2 describes the browsing and searching s^'stem with £Lce indexing, as introduced in 
diiis inveotion. 

Figure 3 precents the &ce index results as pretalning to the example of Figure IC and as 
gex^erated by the present invention. 

Figure 4 shows a prefered embodiment of a face-track data structure. 

Figure 4a shows a set of face charaeteristic views selected as taught by the present invention 
from ft &ce track. 

Fieuxe 5 describes the process of generating a £ace track and extracting associated 
characteristic views and characteristic data 

Figures 6 describes the overall framework £at geoexating tracks of faces £:om a video image 
sequence. 

Figure 7 describes the porcessing flow for tracking a single detection result. 

Figure 8 describes a scheme of correlation tracking for the eyes. 

Figure 9 shows a prcffered embodiment for selecting Face_caharacterisftic_view5. 

Figure 10 shows the extraction of audio data coiresponditig (timewise) to the video data. 

Figure i I shows how to combizid the audio track with the face index to create an audio-visual 

index. 

Figure 12 describes the linking and information merging stage as part of the face indexing 
process. 



DETAH^ED DESCSIFnON OF THE IN VENWOIV 
It is the purpose of the present invention to teach a method of generating an indexed database 
of &cefi that acconomodate lace (peopIe>based queries in video search applications. 

A system for video browsing and seardnng of face content (or any other high-level object) is 
deleted in Fig^ 2rA~vidTO"iiiiage scquttice is mpWirbm ivid^ fo^ m 
feed may be a live program or recorded on tape. Analog video is digitized in video digitizer 
215. OptionaUy, the system insryr receive digital represemailon directly. The video source;, &c 
program selection and digitization parameters and face>radexing selection parameters are all 
controlled by the user fi-om an interface console 230. A subset of the video sequence is input 
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to an indexing module 220, wbdch is implemcmed as taugbt by the present invention. 
Computed face indexiag data are stored in the face indexing data store 250. A graphical 
representation of the ^ania indexing data can be dispjayed and browsed by btowser-seaichfiT 
module 240. Suck a graphical representation is ducted in Figure 3. AddinonaJly, the jsico 
indexing data can be edited b>- a face index editor (270) to correct possible errors that may 
occur during automatic indexing. Such errors can origiiiate from fal^e face detection that is 
identifying non-fisce regions as faces. An addi t ional form of error is an over-segmentation of a 
certain face: two or more instances of tiie same face &il to be linked between appearances and 
thus generate two or more index encrie& These emxrs at^ handled reasonably by any 
recognition scheme: false feces will generally not be recognized and over-segmentadon will 
result in somewhat additional processing time and possibly reduced robustness. In the 
applications in which the generaxed index will be queried £requcntly, it is cost-efTective vo let 
en editor review the graphical representatioD of the face index> to delete &lse alarms and to 
merge entries originating from the same person. 

In a prefeited embodiment the editor con annotate the £ids index by specifying the name of 
the person or linking between the face instance and ano&er database entit>*. This embodiment 
provides a method of semi-automaric annotation of video b>' first generating a face-index as in 
Figure 3 and then maiui&lly annotating ont>' the index entries. Thus, the annoiation becomes 
Immediately linked to all tracks in the \'ideo, which correspond to the specific index entry. 
Since the number of fomeg where a specific person is visible is much larger than the number 
of index entries and since the spatial information, tiiat Is location whhin the £rame is readit>' 
available, a major effort saving is achieved. 

Once tlie face indexing data is stored, a video search engine 250 can access it in order to 
process face-based queries. 

Figure ^ depicts a sample face index that is generated by a particular embodiment of tfte 
present invention for the example depicted in Figure IC. 



The face index mformation, as extracted following face detection and &ce tracking modules, 
ifi ,qtnywi in a gwiia-jtl data, structnre. Figure 4 shows a preferred embodiment of a face track 



data structure. Processing a face track consists of tracking the &ce between detection frames 
to produce contiguous video segments [FsJ^e] as veil as fact track coordinates to be 
associated with each frame in a video segment. In addition, the face track data ^truotme 
includes: 
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• Face Characteristic Views, which are the visually di*tLDct bstanceg of the fac© in the 
track. 

• Foe© Fronlal Vi^s, which are those Face Characteristic Views, classified as frontal. 
Frontal feces have better dunces of being recognized properly. 

• Face Characcerisdc Data that is attributes oon^Ritod the face image data and stored 
for latex cotnparison vnty the corresponding attrib\rtes extracted from the quay image, la 
a preferred embodiment face characteristic data include eye, nose and moudi templates. 
In another ptefetied embodiment face characteristic data include image coordinates of 
geometric face features. In another prefemed embodiment^ face characteristic data 
inciixde coefficients of the eigert-face representadcm. 

• Audio characteristic data that can associated with the &ce track. 

Figure 4a shows a set efface characteristic views selected as taught by the present invention 
fix>m a face track. A star denotes face frontal views selected as taught by the present 



Figure 5 shows a proceas of generating a face track and the associated characteristic data from 
a video image sequence. The processing steps can be initiated at each frame of ttie video or 
more likely at a sub-set of Ac video frames, selected by modulo 510. The sub-set can be 
obtained by even sampling or by a key-frame selection process. 

hi each frame of the subset a fece-like region detection method (520) is applied. Preferably, 
this detection method as taught by prior art locates fecial features. Such features generally 
consitft of eye features and additicnally mouth features and nose leatures. 
La Figure 5, the nresent invention teaches the fbrmation of a fece track structure for a single 
face iegion. Starting with grouped fecial features as output by 520, these features are tracked 
over tune (that is from frame to frame). Preferably, the facial feaxures are tracked frxan frame 
to fi«me (530) by correlation trackmg as known in prior art Both forward and backward 
tracking is used to extract entire face segment. The rcsuit of fece tracking is a &ce track 
which Is a video segment and also the fece track coordinates. 

Funhet procesting of the video track consists of selection of Face_FrontaI_'Vlcwa (540), 
Mvpan ri in g thii g e JbgjM to a Fac e Chaiactcristi c yiwrs fS50\ and ex t racgngj^ ace^ 
Charawcristic Data (560). 

Figures 6, 7 and 8 describe a method of generating tracks of feces from a video image 
sequence. In Figure 6, a set of results from a face detection module 610 initiates tracking 
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processes 620 for each of the results. Each traokmg process results in a single detection face 
track, which consists of &ce looarion cooidinates. and ccafidenoe values for a racgc of frame 
indices, which include the frame of fee orisfaiai detection. Smce a given face may he detected 
multiple time in a shot> the tracks will overlap and a merging step 630 will foUow. The 
unccitainry in face fcatore location as resulting fit>m -flie tracking process is negligible with 
respect to fece size. Therefore, in a prefenned embodiment, n-ack mei^ing is implemenied on 
the basis of spatial proximity* 

In Figui« 7, the processing flow for tracking a single detection result is described. Most face 
detection methods rcfy on the detecwon of fecial features such as eyes. These featurea arc a 
natural choice for face trackine. However, features such as eyes are sensitive to head 
orientation, blinking, etc. To make &c tracking more robust with respect to such disturbances, 
a head tracking sdieme is utilized. The head boundary is estimated firon the fecial feature 
data (soch as eyes) and an appropriate tracking window is initiated around the head. 
In steps 740, 750, 760, 770 Ac eyes and the head are trackisd from the detectian frame 
forward and backward in time until either i3ie end of shot is detected or tracking feils. The 
individual tracks obtained are merged (780) to a single &cc track. 



-4 
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In Figure 8 a scheme of correlation trackiog for the eyes is described. Initialized b>- a 
detection ev^ at frame K the tracking reference frame R is set to K (step <10). In tracking, 
the location of tiic tracking points is predicted based on trajectory estimates or set co the 
previous place "whm the feature has beeo detected (step 830). In case of a feature-pair 
such as the eyes, a similarity transformation can be derived from the two matched points (step 
840). To reduce the apparent difference between the current fi«me and the reference frame 
(due to zoom, rotation, etc.) a transformed version R.* of the latter is used in the actual 
correlation matching (step S60), The apparent change between the reference frame and the 
ciirrent frame is repeatedly tested (step 870) and when significant, the reference frame is 
\ipdated. 



Figure 9 shows how to select the set of Face_ Cbaracteristic.Views from a face track: a video 
segment, vMck kicludes location data for the facial fcatures. Additionally, sets of Face_ 
Frontal_View&, which capture the most fixmtol viewpoint views, are selected as shown also in 
Figure 4a. These views maxmiize the probability of recognition in a fixed number of 



recognition trials. From these sets of &ce views a set of characterizing features fcur tbc face, 
Face_CharacteristicJData are derived. 

The set of Face^Cbaracteristic^Views is a set of f^ templateo at varying viewpoint angles. 
Contiguous frames of similar face appearance can be reduced to a single &ce-frame. Frames 
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that aic different enough in the sense that thej-- can not be reconstructed from existing frames 
via a similarity transfbrmaiion are included in the set. 

Figure 9 shows a. prefeirtjd embodiment for selecting the Facie_Characteristic_Views subject 
to self-fiimilarity criteiia: 

The process starts in 905 from the start frame of track both as a reference xdew and as the only 
member of the Bacc.Characteristic.Views {C>. Given the currentl>* selected reference frame 
I. the consecutive frame K is compared against L In 920 the fiace motion from I to K is 
computed from the matched fecial features. In 930, the fece-like region image in K is 
coc^ensated for the ooaaputed face motion 0om I to K tiae where said motion is computed 
from the matched fecial features (as extracted from the face track data). The face-hla region 
image in K is compensated for the computed fecc motion (930). The compensated region is 
then subtracted from the coTrespondmg face image in I (940) and the difference vahie is used 
to decide whether K is a new Characteristic_Face_Vtew. In a second embodiment, a frame 
diat contains a Fiontal face (taken from the F8ce_Frontal_Views as described below) is 
selected as an initializing &anis and a similar process is perfoimcd on frontal faces to obtain 
all Frontal^Face^Views. 

In an anchoiperson scene, the database will contain a limited set of views, as *ere is limited 
variability in the face and ita viewpoint. In a more dynamic scene, the database will contain a 
lar^ number of entries per frice, encompassing the variety of viewpoints of the person In tiie 
scene. 



Using Face_Cbaracteristic_View3 c^rtures tiie variability of fece appearance in a relatively 
i.m5»ii number of views and thus enables the recognitioD process to be less sensitive to the 
foliowiag (and other) parameters: 
3 sensitivitytothc viewpoint angle of the fece; 

□ sensitivity to fecial expressions, including opening vs closing of the mouth; 
G sensitivity to blinking; 

□ seositivi^ to external distracts, such as sunglasses 



The FacejCharacteristic_Views enables the identification of dominant features of the 
particular fhce, including (among otht^); 
C skin-tone coloring; 
C ey©^br; 
3 hair shades; 
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C any special marks (such as birth marks) that arc consistent 



The FaceJPro7Ual_Views is a set of the more fi-otital views of the fiacc. Thfi«e frames are 
generally the most-recognizable frames. The selection process is irq?lem«inted by symmetiy- 
and quality-controlled criteiia: 

In a preferred embodiment the score is computed from correlation values of eyes and mouth 
candidate regions with at least one eye and mouth template set, respectively. In another 
preferred embodiment^ the quality index depends also on a face orientation score. In that 
embodiment said face orientation score is computed from a mirrored correlation value of the 
two eyes. Sn yet another embodiment, the face centeritne ia estimated from mouth and nose 
location. In that embodiment, the face orientation score is coni^uted from the ratio of 
distances between the left/^i^t eye to the fecial centwline. In yet another embodiment, the 
face quality index inchides also a measure of the occlusion of the face. In that embodiment an 
approxfaMting ellipse is fitred to the head coirtour* The ellipse is tested for intersectian wids 
the frame boimdaries. In yet another embodiment, the oilipse is tested for intersection with 
other regions. 

In a preferred embodunent the process of creating a face track structure inchides also the 
process of computing face characteristic data. Prior art describes a variety of fece recognition 
methods, some based on congelation techniques between input templates, and others utilize 
distance metrica b^'een feature sets. In order to accommodate the recoghition process, a set 
of frice characteristic data is extracted from the Fax»_Chafacteristic_Vie^-3 



In a preferred embodiment Fsce_Characteristic_Data Include: 



* Global informstion; c<wisifits of face templates at selected viewpoints, as well as 
facial component templates^ in one implementation of eyes, nose and mouth. 
Tf^ Facial feature geometrical information indicative of the relationships between 'flie 
fecial components; 

Pi; s= Uruque chai-acteristicSf such as eyeglasses, beard, baldness, hair color; 

Fa - Audio characteristio data. 



Many video sequences mclude multiple feces. For the case of more than one face candidate 
detected in fr-ame T, a localhy check is gursued to check diat the candidates are sufriciently 
distant and do, in fact, represent separate feces. Each &ce>liko region is then tracked, a face- 
segment is extracted and a face-frame set is selecDed, as described above. 
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FoUcwing the definition of a :&ce segment audio infonnation in the fonn of the Audio 
characteristic data is incorpomed as additional infbnnative characteristic for the segment. Fa, 
It is a purpose of the present in^'ention to associate audio characteristic data with a fece track 
or part of a face track. By combining the results of visual-based face recognition and audio- 
based speaker identification, the overaU recognition accuracy can be improved 

Figure 10 shows a timelme and video and audio data, which correspond, to that timeline. The 
face/ho-facc segmentation of the video stream serves as a master temporal segmentation that 
is applied to the audio stream. The audio segments derived can be used to enhance the 
recognition capabilily of any person recognition sy stem built on top of the indexing system, 
which is taught in the present inventian. 

It is a ftoher purpose of the present mvention to match audio characteristic data, which 
correspond to two different fece tracks in order to confirm fee idemhy of Ae face tradcB. 
TbeiefoiB, the present invention utUizes i^or art methods of audio characterization and 
speaker segmentation. Tht latter is required for the case the audio may correspond to at least 
TWO speakers^ 

Figure 11 shows how to combine the audio track with the 6ce index to create an audicvisual 
index. The prweat Invention uses prior an method in speech processing and speaker 
identiflcatiaxi. 

Prior art models speakers uaing a net^'ork of hidden Markov models (HMM), Each speaker is 
modeled using a HMM consisting of states conxspondiog to the acoustic patterns produced 
by the speaker. In addition to modeling speakws, HMMs can also be used to model silence 
and non-speech signals such as laughter. 

In a preferred embodiment no-prior knowledge of the speakers is assumed. In that 
embodiment unsupervised speaker segmentatim is done using an itemtive algorithm. 
Parameters for the speaker HMMs are first initialized using a clustering procedure and than 



itcTBtively bnprtJved usmg the Viterbr^^ 

lo a preferred embodhnent, the segmentation of the audio track is aided by visual cues from 
the face indexing process 1 130. In particular, the audio is partitioned with respect to the audio 
content (1130). For example when an entire shot includes a single face track, the initial 
hj-potiiesis can be a single speaker. Once verified, the audio characteristic data (sudi as tiie 
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HMMs parameters) are a<^ociated with that fece. In another example, when an entire shot 
includes only two faces, tht initinl hypothesis can be two speakers. Once verified, the audio 
characteristic data of a speech segmcm arc associated with the face of highest mouth activity 
as computed by the visual mouth activity detector 112G. In yet another embodimeat, audio 
characteristic data are matched against mouth movement. 



The present invention teaches how to incorporate the extracted information from a face 
segment fasto a face database^ providing links between similar &ce segments and merging the 
infbnnation. The linking and infoimation-mexging atage, as part of the face indexing process, 
is depicted in Figure. 12. If the face database is empty (1210), the detected fioce segment 
initializes the database, providing its first cntiy. Otherwise, distances are calculated between 
the new face segment characteristics and each face entry in tiie database (1220). An overall 
distance measure is calculated as a fimctioD of the individual distance components, in one 
embodimait being &e weighted sum of the distances. Distance measures are ranked in 
increasing order (1230). The smallest distance is compared to a SimUarity threshold 
parameter (1240) to categorize the entry as a new face to be entered to the database, or as an 
ahvady exi'gting face, in which case the infomiAtiOQ is merged to an existing entry in the 
database. 



The embodiment in the present invention has been restricted to the indexing of ^cial content 
in video sequence. However^ iSke methods taught by the invention can be readily modified to 
include indexing for browsing and searching of other structured objects. As long as a 
correspondence can be established for an object by a combination of tracking and matching 
across different \ideo shots, a similar index structure, which consists of Object log. Object 
characteristic views and Object Charactmstic Data can be constructed. Furthermore, this 
tracking and matching steps can be used to automate the insertion of a new object into the 
database. 

As a single example consider matching man-made objects such as building. Suppose that we 
want to recognize the White House in a video program on the president The program will 
include several video scenes in which the White House is visible. Each such scene includes at 
least one object track pf the White House. Since ieature points such as comers characterize 



in^-niide (!ibj&eu, stieh f&aflwe po&Ms e^ tt> track the objects from Ihune to firmer 

Additionally, by mmnhing sets of fbature-painbs across video scene, correspondence between 
views can be established across video shot boundaries. Given tracks of olrjects, 
Object_Characteristic_ytew are selected and Objecc_Characteristic__Data are computed as 
taught by the present invention. Once the object index has been constructed is can be used to 
browse end search simile objects in the index rather than in the raw video. 
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We claim: 

1. A method for crcatiiig a visual index of p^sons &om a video iinage sequence comprising 
of; 

Detecting at least one face in a video fraine> and 

Testing at least one face video frame for depicting the same person in at least another 
face instance in said video image i;equence; 

2. A method for creating a visoal index of persons firom a video image sequence comprising 
□f: 

Detecting at least one face in a video frame, and 
Tracking said detected £ace in said image sequence; 
2, A method as in 1 or in 2 where said index includes video &ame number data for at least 
one &ce in the index 

4. A method as in 1 or in 2 whexe said index includes video frame location data for at least 
one face in the index 

5. A me^od of selecting face characteristic views from a video image sequence comprising 
of. 

Creating an index of feces from a video image sequence, and 

Selecting visually distmct video frames from the index, which, depict at least one person. 

6. A method of selecting face frontal views fi»m a \^'deo image sequence comprising of: 

Creating an index of £ices frt>m a video image sequence, and 

Selecting visuall>' distinct video frames from the index, which depict at least one person 
buch that the iacc pose in each of these frames is frnntaL 

7. A method of recognizing &ces in video comprising of: 

Creating an index of ^ces Bam a video image sequence, and 

Matching a set of at least one quety &ces against said index 
S. A method as in 5 and 6 where said matching is done against said &ce characteristic views. 
9, A method as in 5 where said visually distinct video frames differ at least by head 
oiiantation. 

IDA method as in 5 or in 6 v^iere said visually distinct video {fames differ at least by moudi 
api^eaxance. 

1 1 .A method as in 5 or in 6 where said visually distinct video frames dififer at least by eyes 



12 A method of annotating frice content in a video image sequence comprising o^ 

Creating an index of &C6S from a \'ideo image sequence^ and 
Attaching annotadou data to the index. 

13 A method for creating an audio-visual index of persons from a video image sequence 
comphsing of: 
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Creating an index of feces fiom a video image sequence, and 
Matching audio characieristic data tc said index. 
14 A toethod as ia 13 where said matching comprises of comparing speech time segments 

Willi hc& instance time segment 
15.A metiujd as ia 14 where said matching includes also detecting mouth activity for at least 
one face instance. 

1 6 A method as in i4 where said matching inchides also matching speech movements to 
month movemi^t for at least one face instance. 

17 A method as in 1 where testing for depicdng the sam« person compfises also of comparing 
audio characteristic data. 
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